{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Huggingface Sagemaker-sdk - Distributed Training Demo\n",
    "### Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Tutorial](#Tutorial)  \n",
    "2. [Set up a development environment and install sagemaker](#Set-up-a-development-environment-and-install-sagemaker)\n",
    "    1. [Installation](#Installation)  \n",
    "    2. [Development environment](#Development-environment)  \n",
    "    3. [Permissions](#Permissions) \n",
    "4. [Choose 🤗 Transformers `examples/` script](#Choose-%F0%9F%A4%97-Transformers-examples/-script)  \n",
    "1. [Configure distributed training and hyperparameters](#Configure-distributed-training-and-hyperparameters)  \n",
    "2. [Create a `HuggingFace` estimator and start training](#Create-a-HuggingFace-estimator-and-start-training)   \n",
    "3. [Upload the fine-tuned model to huggingface.co](#Upload-the-fine-tuned-model-to-huggingface.co)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial\n",
    "\n",
    "We will use the new [Hugging Face DLCs](https://github.com/aws/deep-learning-containers/tree/master/huggingface) and [Amazon SageMaker extension](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator) to train a distributed Seq2Seq-transformer model on `summarization` using the `transformers` and `datasets` libraries and upload it afterwards to [huggingface.co](http://huggingface.co) and test it.\n",
    "\n",
    "As [distributed training strategy](https://huggingface.co/transformers/sagemaker.html#distributed-training-data-parallel) we are going to use [SageMaker Data Parallelism](https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/), which has been built into the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) API. To use data-parallelism we only have to define the `distribution` parameter in our `HuggingFace` estimator.\n",
    "\n",
    "```python\n",
    "# configuration for running training on smdistributed Data Parallel\n",
    "distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}\n",
    "```\n",
    "\n",
    "In this tutorial, we will use an Amazon SageMaker Notebook Instance for running our training job. You can learn [here how to set up a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).\n",
    "\n",
    "**What are we going to do:**\n",
    "\n",
    "- Set up a development environment and install sagemaker\n",
    "- Chose 🤗 Transformers `examples/` script\n",
    "- Configure distributed training and hyperparameters\n",
    "- Create a `HuggingFace` estimator and start training\n",
    "- Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n",
    "- Test inference\n",
    "\n",
    "### Model and Dataset\n",
    "\n",
    "We are going to fine-tune [facebook/bart-base](https://huggingface.co/facebook/bart-base) on the [samsum](https://huggingface.co/datasets/samsum) dataset. *\"BART is sequence-to-sequence model trained with denoising as pretraining objective.\"* [[REF](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md)]\n",
    "\n",
    "The `samsum` dataset contains about 16k messenger-like conversations with summaries. \n",
    "\n",
    "```python\n",
    "{'id': '13818513',\n",
    " 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',\n",
    " 'dialogue': \"Amanda: I baked cookies. Do you want some?\\r\\nJerry: Sure!\\r\\nAmanda: I'll bring you tomorrow :-)\"}\n",
    "```\n",
    "\n",
    "_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Set up a development environment and install sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "_**Note:** The use of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have an SDK installed, connectivity to the cloud and appropriate permissions, such as a Laptop, another IDE or a task scheduler like Airflow or AWS Step Functions._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker>=2.48.0\" \"transformers==4.6.1\" \"datasets[s3]==1.6.2\" --upgrade -i https://opentuna.cn/pypi/web/simple\n",
    "#!apt install git-lfs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash\n",
    "#!sudo yum install git-lfs -y\n",
    "#!git lfs install"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Development environment "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**upgrade ipywidgets for `datasets` library and restart kernel, only needed when prerpocessing is done in the notebook**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "import IPython\n",
    "#!conda install -c conda-forge ipywidgets -y\n",
    "#IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker.huggingface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Permissions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "print(f\"IAM role arn used for running training: {role}\")\n",
    "print(f\"S3 bucket used for storing artifacts: {sess.default_bucket()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choose 🤗 Transformers `examples/` script\n",
    "\n",
    "The [🤗 Transformers repository](https://github.com/huggingface/transformers/tree/master/examples) contains several `examples/`scripts for fine-tuning models on tasks from `language-modeling` to `token-classification`. In our case, we are using the `run_summarization.py` from the `seq2seq/` examples. \n",
    "\n",
    "_**Note**: you can use this tutorial identical to train your model on a different examples script._\n",
    "\n",
    "Since the `HuggingFace` Estimator has git support built-in, we can specify a [training script that is stored in a GitHub repository](https://sagemaker.readthedocs.io/en/stable/overview.html#use-scripts-stored-in-a-git-repository) as `entry_point` and `source_dir`.\n",
    "\n",
    "We are going to use the `transformers 4.4.2` DLC which means we need to configure the `v4.4.2` as the branch to pull the compatible example scripts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "git_config = {'repo': 'https://gitee.com/whn09/transformers.git','branch': 'v4.6.1.1'} # v4.6.1 is referring to the `transformers_version` you use in the estimator."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure distributed training and hyperparameters\n",
    "\n",
    "Next, we will define our `hyperparameters` and configure our distributed training strategy. As hyperparameter, we can define any [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#seq2seqtrainingarguments) and the ones defined in [run_summarization.py](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#sequence-to-sequence-training-and-evaluation). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hyperparameters, which are passed into the training job\n",
    "hyperparameters={'per_device_train_batch_size': 1,\n",
    "                 'per_device_eval_batch_size': 1,\n",
    "                 'model_name_or_path': 'facebook/bart-large-cnn',\n",
    "                 'dataset_name': 'samsum',\n",
    "                 'do_train': True,\n",
    "                 'do_eval': True,\n",
    "                 'do_predict': True,\n",
    "                 'predict_with_generate': True,\n",
    "                 'output_dir': '/opt/ml/model',\n",
    "                 'num_train_epochs': 1,\n",
    "                 'learning_rate': 5e-5,\n",
    "                 'seed': 7,\n",
    "                 'save_total_limit':3,\n",
    "                 'fp16': False,\n",
    "                 }\n",
    "\n",
    "# configuration for running training on smdistributed Data Parallel\n",
    "#distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a `HuggingFace` estimator and start training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.huggingface import HuggingFace\n",
    "\n",
    "# create the Estimator\n",
    "huggingface_estimator = HuggingFace(\n",
    "      entry_point='run_summarization.py', # script\n",
    "      source_dir='./examples/pytorch/summarization', # relative path to example\n",
    "      git_config=git_config,\n",
    "      instance_type='ml.p3.2xlarge',\n",
    "      instance_count=1,\n",
    "      transformers_version='4.6',\n",
    "      pytorch_version='1.7',\n",
    "      py_version='py36',\n",
    "      volume_size=500, #instance memory\n",
    "      role=role,\n",
    "      base_job_name='bart-en',\n",
    "      hyperparameters = hyperparameters\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# starting the train job\n",
    "huggingface_estimator.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploying the endpoint\n",
    "\n",
    "To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = huggingface_estimator.deploy(1,\"ml.g4dn.xlarge\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we use the returned predictor object to call the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
    "    Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
    "    Jeff: ok.\n",
    "    Jeff: and how can I get started? \n",
    "    Jeff: where can I find documentation? \n",
    "    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face                                           \n",
    "    '''\n",
    "\n",
    "data= {\"inputs\":conversation}\n",
    "\n",
    "predictor.predict(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we delete the endpoint again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n",
    "\n",
    "We can download our model from Amazon S3 and unzip it using the following snippet.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tarfile\n",
    "from sagemaker.s3 import S3Downloader\n",
    "\n",
    "local_path = 'my_bart_model'\n",
    "\n",
    "os.makedirs(local_path, exist_ok = True)\n",
    "\n",
    "# download model from S3\n",
    "S3Downloader.download(\n",
    "    s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located\n",
    "    local_path=local_path, # local path where *.targ.gz is saved\n",
    "    sagemaker_session=sess # sagemaker session used for training the model\n",
    ")\n",
    "\n",
    "# unzip model\n",
    "tar = tarfile.open(f\"{local_path}/model.tar.gz\", \"r:gz\")\n",
    "tar.extractall(path=local_path)\n",
    "tar.close()\n",
    "os.remove(f\"{local_path}/model.tar.gz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we are going to upload our model to [huggingface.co](http://huggingface.co) we need to create a `model_card`. The `model_card` describes the model includes hyperparameters, results and which dataset was used for training. To create a `model_card` we create a `README.md` in our `local_path`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read eval and test results \n",
    "with open(f\"{local_path}/eval_results.json\") as f:\n",
    "    eval_results_raw = json.load(f)\n",
    "    eval_results={}\n",
    "    eval_results[\"eval_rouge1\"] = eval_results_raw[\"eval_rouge1\"]\n",
    "    eval_results[\"eval_rouge2\"] = eval_results_raw[\"eval_rouge2\"]\n",
    "    eval_results[\"eval_rougeL\"] = eval_results_raw[\"eval_rougeL\"]\n",
    "    eval_results[\"eval_rougeLsum\"] = eval_results_raw[\"eval_rougeLsum\"]\n",
    "\n",
    "with open(f\"{local_path}/test_results.json\") as f:\n",
    "    test_results_raw = json.load(f)\n",
    "    test_results={}\n",
    "    test_results[\"test_rouge1\"] = test_results_raw[\"test_rouge1\"]\n",
    "    test_results[\"test_rouge2\"] = test_results_raw[\"test_rouge2\"]\n",
    "    test_results[\"test_rougeL\"] = test_results_raw[\"test_rougeL\"]\n",
    "    test_results[\"test_rougeLsum\"] = test_results_raw[\"test_rougeLsum\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "After we extract all the metrics we want to include we are going to create our `README.md`. Additionally to the automated generation of the results table we add the metrics manually to the `metadata` of our model card under `model-index`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(eval_results)\n",
    "print(test_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "\n",
    "MODEL_CARD_TEMPLATE = \"\"\"\n",
    "---\n",
    "language: en\n",
    "tags:\n",
    "- sagemaker\n",
    "- bart\n",
    "- summarization\n",
    "license: apache-2.0\n",
    "datasets:\n",
    "- samsum\n",
    "model-index:\n",
    "- name: {model_name}\n",
    "  results:\n",
    "  - task: \n",
    "      name: Abstractive Text Summarization\n",
    "      type: abstractive-text-summarization\n",
    "    dataset:\n",
    "      name: \"SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization\" \n",
    "      type: samsum\n",
    "    metrics:\n",
    "       - name: Validation ROGUE-1\n",
    "         type: rogue-1\n",
    "         value: 42.621\n",
    "       - name: Validation ROGUE-2\n",
    "         type: rogue-2\n",
    "         value: 21.9825\n",
    "       - name: Validation ROGUE-L\n",
    "         type: rogue-l\n",
    "         value: 33.034\n",
    "       - name: Test ROGUE-1\n",
    "         type: rogue-1\n",
    "         value: 41.3174\n",
    "       - name: Test ROGUE-2\n",
    "         type: rogue-2\n",
    "         value: 20.8716\n",
    "       - name: Test ROGUE-L\n",
    "         type: rogue-l\n",
    "         value: 32.1337\n",
    "widget:\n",
    "- text: | \n",
    "    Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
    "    Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
    "    Jeff: ok.\n",
    "    Jeff: and how can I get started? \n",
    "    Jeff: where can I find documentation? \n",
    "    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n",
    "---\n",
    "## `{model_name}`\n",
    "This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container.\n",
    "For more information look at:\n",
    "- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)\n",
    "- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)\n",
    "- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)\n",
    "- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)\n",
    "- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)\n",
    "## Hyperparameters\n",
    "    {hyperparameters}\n",
    "## Usage\n",
    "    from transformers import pipeline\n",
    "    summarizer = pipeline(\"summarization\", model=\"philschmid/{model_name}\")\n",
    "    conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
    "    Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
    "    Jeff: ok.\n",
    "    Jeff: and how can I get started? \n",
    "    Jeff: where can I find documentation? \n",
    "    Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face                                           \n",
    "    '''\n",
    "    nlp(conversation)\n",
    "## Results\n",
    "| key | value |\n",
    "| --- | ----- |\n",
    "{eval_table}\n",
    "{test_table}\n",
    "\"\"\"\n",
    "\n",
    "# Generate model card (todo: add more data from Trainer)\n",
    "model_card = MODEL_CARD_TEMPLATE.format(\n",
    "    model_name=f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\",\n",
    "    hyperparameters=json.dumps(hyperparameters, indent=4, sort_keys=True),\n",
    "    eval_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in eval_results.items()),\n",
    "    test_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in test_results.items()),\n",
    ")\n",
    "with open(f\"{local_path}/README.md\", \"w\") as f:\n",
    "    f.write(model_card)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have our unzipped model and model card located in `my_bart_model` we can use the either `huggingface_hub` SDK to create a repository and upload it to [huggingface.co](http://huggingface.co) or go to https://huggingface.co/new an create a new repository and upload it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from getpass import getpass\n",
    "from huggingface_hub import HfApi, Repository\n",
    "\n",
    "hf_username = \"philschmid\" # your username on huggingface.co\n",
    "hf_email = \"philipp@huggingface.co\" # email used for commit\n",
    "repository_name = f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\" # repository name on huggingface.co\n",
    "password = getpass(\"Enter your password:\") # creates a prompt for entering password\n",
    "\n",
    "# get hf token\n",
    "token = HfApi().login(username=hf_username, password=password)\n",
    "\n",
    "# create repository\n",
    "repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)\n",
    "\n",
    "# create a Repository instance\n",
    "model_repo = Repository(use_auth_token=token,\n",
    "                        clone_from=repo_url,\n",
    "                        local_dir=local_path,\n",
    "                        git_user=hf_username,\n",
    "                        git_email=hf_email)\n",
    "\n",
    "# push model to the hub\n",
    "model_repo.push_to_hub()\n",
    "\n",
    "print(f\"https://huggingface.co/{hf_username}/{repository_name}\")"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "interpreter": {
   "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
  },
  "kernelspec": {
   "display_name": "conda_pytorch_latest_p36",
   "language": "python",
   "name": "conda_pytorch_latest_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
